List of AI News about SWE Bench
| Time | Details |
|---|---|
|
2026-06-11 10:38 |
Claude Fable 5 Breakthrough hits 80.3% SWE-bench
According to AINewsOfficial_, Claude Fable 5 posts 80.3% on SWE-bench Pro and adds 1M context with 128k output for multi-day autonomy, surpassing rivals. |
|
2026-05-19 08:04 |
Claude Opus 4.7 Regression Sparks Dev Backlash
According to @godofprompt, Opus 4.7 ignores project instructions and skips MCP configs; Anthropic acknowledged regressions versus 4.6 despite higher benchmarks. |
|
2026-05-11 08:38 |
Kimi K2.6 Disrupts Claude with 1/6 price
According to @_avichawla, Kimi K2.6 matches Claude’s chat, code, cowork at 1/6 price, ranks #1 on OpenRouter, and posts 58.6 on SWE-Bench Pro. |
|
2026-05-09 22:15 |
Claude Opus 4.7 Boosts SWE-bench to 87.6%
According to @godofprompt, Claude Opus 4.7 follows instructions literally, lifts SWE-bench to 87.6% from 80.8%, and breaks 4.6-tuned prompts. |
|
2026-04-09 18:28 |
Claude Sonnet Plus Opus Advisor Boosts SWE-bench Multilingual by 2.7 Points at 11.9% Lower Cost — Latest Evaluation Analysis
According to @claudeai on Twitter, Sonnet paired with an Opus advisor achieved a 2.7 percentage point higher score on SWE-bench Multilingual than Sonnet alone while reducing per-task cost by 11.9%. As reported by the Claude account post, this advisor-enhanced workflow indicates measurable quality gains and cost efficiency in multilingual software engineering benchmarks. For AI product teams, the data suggests a practical orchestration strategy: route primary reasoning to Sonnet and use Opus selectively for guidance to improve pass rates and lower run-time spending. According to the tweet, these results come from evals on SWE-bench Multilingual, highlighting a repeatable method for cost-aware performance optimization in LLM-based coding assistants. |
|
2026-02-27 12:10 |
MiniMax M2.5 Beats Opus 4.6 on SWE-Bench Verified: 80.2% Score, 3x Faster, $1 Hour—AI Coding Benchmark Analysis
According to God of Prompt on X (Twitter), MiniMax M2.5 surpassed Opus 4.6 on the SWE-Bench Verified benchmark with an 80.2% score, delivers roughly 3x faster execution, and is offered at a flat $1 per hour, while using only 10B activated parameters, positioning it as the smallest Tier-1 model for coding tasks. As reported by the same source, these metrics imply lower latency and significantly reduced inference cost, enabling 24/7 autonomous coding agents and continuous integration bots at practical budgets. According to the post, the combination of high benchmark accuracy and small active parameter count suggests strong efficiency-per-dollar, which can improve ROI for software teams deploying code assistants, test repair bots, and maintenance agents in production pipelines. |